Read-in the data-file sent to you. It was downloaded from Kaggle and comes with the following information:
Mobile App Statistics (Apple iOS app store) The ever-changing mobile landscape is a challenging space to navigate. . The percentage of mobile over desktop is only increasing. Android holds about 53.2% of the smartphone market, while iOS is 43%. To get more people to download your app, you need to make sure they can easily find your app. Mobile app analytics is a great way to understand the existing strategy to drive growth and retention of future user.
With million of apps around nowadays, the following data set has become very key to getting top trending apps in iOS app store. This data set contains more than 7000 Apple iOS mobile application details. The data was extracted from the iTunes Search API at the Apple Inc website. R and linux web scraping tools were used for this study.
Dimension of the data set; 7197 rows and 16 columns
The contents include:
There may be an errant column that gets read-in, likely named X; feel free to NULL this.
Draw an appropriate plot that shows the distribution of apps’ content rating. Make sure you title the plot, label the x-axis and y-axis. Most apps carry what content-rating?
library(ggplot2)
library(hrbrthemes)
library(ggthemes)
library(dplyr)
ratings <- AppleStore %>% mutate(content = ordered(cont_rating,
levels = c("4+", "9+", "12+", "17+")))
ggplot(ratings, aes(content, fill = content)) +
geom_bar() + labs(title = "Content Ratings of Apps in Apple's App Store",
x = "Content Rating", y = "Frequency") +
theme_ipsum_tw() + theme(legend.position = "none")Draw an appropriate plot that shows the distribution of apps’ primary genre. Make sure you title the plot, label the x-axis and y-axis. What three primary genres have the most apps? What primary genre has the fewest apps?
ggplot(AppleStore, aes(prime_genre, fill = prime_genre)) +
geom_bar() + coord_flip() + labs(title = "Primary Genre of Apps in Apple's App Store",
x = "Genre", y = "Frequency") + theme_ipsum_tw() +
theme(legend.position = "none")Games, Entertainment, and Education have the most apps, while Catalogs, Medical, and Navigation appear to have the fewest apps.
Subset the data so that you only have apps with price less than 100. Then use an appropriate plot to show the distribution of price. Is price distributed symmetrically or is it skewed? If skewed, is the skew positive or negative?
ratings2 = subset(ratings, price < 100)
ggplot(ratings2, aes(x = "", price)) + geom_boxplot(fill = "moccasin") +
coord_flip() + labs(title = "Price of Apps in Apple's App Store",
subtitle = "(Apps capped at price < $100)",
y = "Price", x = "") + theme_ipsum_rc()There is extreme positive skew, with several outliers.
Or, alternatively
ggplot(ratings2, aes(x = price)) + geom_histogram(fill = "moccasin") +
labs(title = "Price of Apps in Apple's App Store",
subtitle = "(Apps capped at price < $100)",
x = "Price", y = "Frequency") + theme_ipsum_rc()Does this distribution look different if you disaggregate your plot in # 4() by apps’ content rating?
ggplot(ratings2, aes(x = "", y = price, fill = content)) +
geom_boxplot() + coord_flip() + facet_wrap(~content) +
labs(title = "Price of Apps in Apple's App Store",
subtitle = "(Apps capped at price < $100)",
y = "Price", x = "") + theme_ipsum_rc() +
theme(legend.position = "none")No, the distributions remain heavily positively skewed with outliers in each.
This is not a bad plot but we could do better …
ggplot(ratings2, aes(x = content, y = price,
fill = content)) + geom_boxplot() + coord_flip() +
facet_wrap(~content, ncol = 1) + labs(title = "Price of Apps in Apple's App Store",
subtitle = "(Apps capped at price < $100)",
y = "Price", x = "Content Rating") +
theme_ipsum_rc() + theme(legend.position = "none")Or alternatively
library(ggridges)
ggplot(ratings2, aes(x = price, y = content,
fill = content)) + geom_density_ridges() +
labs(title = "Price of Apps in Apple's App Store",
subtitle = "(Apps capped at price < $100)",
x = "Price", y = "Content Rating") +
theme_ipsum_rc() + theme(legend.position = "none")Load the EPA data used for Task 03. Create a data-set of the number of cars of each make by year. Then restrict this data-set to only include the following makes – “Ford”, “General Motors”, “GMC”, “Chevrolet”, “Dodge”, “Pontiac”, “Honda”, “Mazda”, “Toyota”, “Subaru”, “BMW”, “Mercedes-Benz”. Now draw an appropriate plot that shows the number of cars per make and year. Use colors to distinguish between makes.
load("~/Documents/Teaching/mpa5830/data/epa.RData")
library(dplyr)
epa2 <- epa %>% select(city08, drive, fuelType1,
highway08, make, year)
tab01 <- epa2 %>% group_by(make, year) %>%
summarise(Frequency = n()) %>% filter(make %in%
c("Ford", "General Motors", "GMC", "Chevrolet",
"Dodge", "Pontiac", "Honda", "Mazda",
"Toyota", "Subaru", "BMW", "Mercedes-Benz"),
year < 2019)
ggplot(tab01, aes(x = year, y = Frequency,
group = make, color = make)) + geom_line() +
labs(x = "Year", y = "Number of Models",
title = "Number of Car Models", subtitle = "(by Manufacturer and Year)",
caption = "Source: U.S. Environmental Protection Agency",
color = "") + scale_color_ptol() +
theme_ipsum_rc() + theme(legend.position = "bottom")You could also do …
ggplot(tab01, aes(x = year, y = Frequency,
group = make, fill = make)) + geom_bar(stat = "identity") +
facet_wrap(~make) + labs(x = "Year",
y = "Number of Models", title = "Number of Car Models",
subtitle = "(by Manufacturer and Year)",
caption = "Source: U.S. Environmental Protection Agency",
fill = "") + scale_color_ptol() + theme_ipsum_rc() +
theme(legend.position = "bottom")If you only filter(), you would have
epa3 <- epa %>% select(city08, drive, fuelType1,
highway08, make, year) %>% filter(make %in%
c("Ford", "General Motors", "GMC", "Chevrolet",
"Dodge", "Pontiac", "Honda", "Mazda",
"Toyota", "Subaru", "BMW", "Mercedes-Benz"),
year < 2019)
ggplot(epa3, aes(x = year, group = make,
fill = make)) + geom_bar() + facet_wrap(~make) +
labs(x = "Year", y = "Number of Models",
title = "Number of Car Models", subtitle = "(by Manufacturer and Year)",
caption = "Source: U.S. Environmental Protection Agency",
fill = "") + scale_color_ptol() +
theme_ipsum_rc() + theme(legend.position = "bottom")Use an appropriate plot to explore the relationship between highway08 and city08 miles per gallon by year. Are the two related positively or negatively? Does the relationship appear to be weak or strong?
library(viridis)
ggplot(epa2, aes(x = highway08, y = city08,
color = year)) + geom_point() + labs(x = "Highway miles per gallon",
y = "City miles per gallon", title = "City vs. Highway miles per gallon",
subtitle = "(by Year)", caption = "Source: U.S. Environmental Protection Agency",
color = "") + scale_color_viridis_c(option = "viridis",
direction = 1) + theme_ipsum_rc() + theme(legend.position = "bottom")The two appear to be positively correlated and the relationship appears to be a strong one, especially so for recent years.
Does the preceding relationship differ by fuelType1?
ggplot(epa2, aes(x = highway08, y = city08,
colour = year)) + geom_point() + facet_wrap(~fuelType1) +
labs(x = "Highway miles per gallon",
y = "City miles per gallon", title = "City vs. Highway miles per gallon",
subtitle = "(by Fuel Type)", caption = "Source: U.S. Environmental Protection Agency",
color = "") + scale_color_viridis_c(option = "viridis",
direction = 1) + theme_ipsum_rc() + theme(legend.position = "bottom")No, the two miles per gallon still appear to be strongly related.
Does it differ by drive?
ggplot(epa2, aes(x = highway08, y = city08,
colour = year)) + geom_point() + facet_wrap(~drive) +
labs(x = "Highway miles per gallon",
y = "City miles per gallon", title = "City vs. Highway miles per gallon",
subtitle = "(by Drive-type)", caption = "Source: U.S. Environmental Protection Agency",
color = "") + scale_color_viridis_c(option = "viridis",
direction = 1) + theme_ipsum_rc() + theme(legend.position = "bottom")While the two miles per gallon still appear to be strongly related, we see a stronger relationship for All-Wheel Drive, Front-Wheel Drive, and Rear-Wheel Drive Vehicles.
If you wanted to eliminate the blank panel, which you should since it is non-informative, you could do this.
epa3 <- epa2 %>% filter(drive != "")
ggplot(epa3, aes(x = highway08, y = city08,
color = year)) + geom_point() + facet_wrap(~drive,
ncol = 2) + labs(x = "Highway miles per gallon",
y = "City miles per gallon", title = "City vs. Highway miles per gallon",
subtitle = "(by Drive-type)", caption = "Source: U.S. Environmental Protection Agency",
color = "") + scale_color_viridis_c(option = "viridis",
direction = 1) + theme_ipsum_rc() + theme(legend.position = "bottom")